Load the necessary libraries
library(car) #for regression diagnostics
library(broom) #for tidy output
library(broom.mixed)
library(ggfortify) #for model diagnostics
library(sjPlot) #for outputs
library(knitr) #for kable
library(effects) #for partial effects plots
library(emmeans) #for estimating marginal means
library(ggeffects) #for partial effects plots
library(MASS) #for glm.nb
library(MuMIn) #for AICc
library(nlme)
library(lme4) #for lmer
library(lmerTest) #for satterthwaite p-values with lmer
library(performance) #for residuals diagnostics
library(see) #for plotting residuals
#library(pbkrtest) #for kenward-roger p-values with lmer
library(glmmTMB) #for glmmTMB
library(DHARMa) #for residuals and diagnostics
library(tidyverse) #for data wrangling
theme_set(theme_classic())
A plant pathologist wanted to examine the effects of two different strengths of tobacco virus on the number of lesions on tobacco leaves. She knew from pilot studies that leaves were inherently very variable in response to the virus. In an attempt to account for this leaf to leaf variability, both treatments were applied to each leaf. Eight individual leaves were divided in half, with half of each leaf inoculated with weak strength virus and the other half inoculated with strong virus. So the leaves were blocks and each treatment was represented once in each block. A completely randomised design would have had 16 leaves, with 8 whole leaves randomly allocated to each treatment.
Tobacco plant
Sampling design
Format of tobacco.csv data files
| leaf | treat | nlegion |
|---|---|---|
| 1 | Strong | 35.898 |
| 1 | Week | 25.02 |
| 2 | Strong | 34.118 |
| 2 | Week | 23.167 |
| 3 | Strong | 35.702 |
| 3 | Week | 24.122 |
| ... | ... | ... |
| leaf | The blocking factor - Factor B |
| treat | Categorical representation of the strength of the tobacco virus - main factor of interest Factor A |
| nlegion | Number of lesions on that part of the tobacco leaf - response variable |
tobacco <- read_csv('../data/tobacco.csv', trim_ws=TRUE) %>%
janitor::clean_names() %>%
rename(nlegion = number, treat = treatment) %>%
mutate(leaf = factor(leaf), treat = fct_rev(treat))
glimpse(tobacco)
## Rows: 16
## Columns: 3
## $ leaf <fct> L1, L1, L2, L2, L3, L3, L4, L4, L5, L5, L6, L6, L7, L7, L8, L8
## $ treat <fct> Strong, Weak, Strong, Weak, Strong, Weak, Strong, Weak, Stron…
## $ nlegion <dbl> 35.89776, 25.01984, 34.11786, 23.16740, 35.70215, 24.12191, 2…
Model formula: \[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i =\boldsymbol{\beta} \bf{X_i} + \boldsymbol{\gamma} \bf{Z_i} \]
where \(\boldsymbol{\beta}\) and \(\boldsymbol{\gamma}\) are vectors of the fixed and random effects parameters respectively and \(\bf{X}\) is the model matrix representing the overall intercept and effects of the treatment on the number of lesions. \(\bf{Z}\) represents a cell means model matrix for the random intercepts associated with leaves.
We will start by explicitly declaring the categorical variable (treat) as a factor. In addition, random effects (in this case leaf) should also be declared as factors.
# already done above
# tobacco = tobacco %>%
# mutate(leaf=factor(leaf),
# treat=factor(treat))
To explore the assumptions of homogeneity of variance and normality, a boxplot of each Treatment level is appropriate.
ggplot(tobacco, aes(y=nlegion, x=treat)) +
geom_boxplot() +
labs(x = "Treatment", y = "Legion count")
This leads us to say that we think we can get away with a gaussian distribution, despite the two outliers in the strong treatment
Conclusions:
It can also be useful to get a sense of the consistency across blocks (leaf). That is, do all leaves have a similar baseline level of lesion susceptibility and do they respond similarly to the treatment.
ggplot(tobacco, aes(y=nlegion, x=as.numeric(leaf))) +
geom_line(aes(linetype=treat))
## If we want to retain the original leaf labels
ggplot(tobacco, aes(y=nlegion, x=as.numeric(leaf))) +
geom_blank(aes(x=leaf)) +
geom_line(aes(linetype=treat))
Conclusions:
Given that there are only two levels of Treatment (Strong and Weak), it might be easier to visualise the differences in baselines and effect consistency by plotting as:
ggplot(tobacco, aes(y=nlegion, x=treat, group=leaf)) +
geom_point() +
geom_line(aes(x=as.numeric(treat)))
‘group’ works similar to ‘by,’ just that group is more versatile when it comes to mapping, etc. because it will automatically group by polygon, while by needs a named list as input. Otherwise, they work very similarly.
Conclusions:
The above figure also serves as a good way to visualise certain aspects of mixed effects models. When we fit a mixed effects model that includes a random blocking effect (in this case leaf), we are indicating that we are allowing there to be a different intercept for each block (leaf). In the current case, the intercept will represent the first Treatment level (Strong). So the random effect is specifying that the intercept can vary from Leaf to Leaf.
We can think of the model as having two tiers (a hierarchy), where the tiers of the hierarchy represent progressively smaller (typically) spatial scales. In the current example, the largest spatial units are the leaves (blocking factor). Within the leaves, there are the two Treatments (Strong and Weak) and within the Treatments are the individual observations.
Note: We can allow this, but also the model will tell us if this is unnecessary. If all the lines were parallel, we for sure would only need random intercepts.
We tend to represent this hierarchy upside down in the model formula:
\[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i =\beta_0 + \boldsymbol{\beta} \bf{X_i}\\ \beta_0 = \boldsymbol{\gamma} \bf{Z_i} \]
In addition to allowing there to be a different intercept per leaf, we could allow there to be a different magnitude of effect (difference between Strong and Week Treatment) per leaf. That is considered a random slope. From the figure above, there is some evidence that the effects (slopes) may vary from Leaf to Leaf.
Incorporating a random slope (in addition to a random intercept), may reduce the amount of unexplained variance and thus improve the power of the main effect (Treatment effect).
The default nlmib optimizer for determining maximum likelihood and restricted maximum likelihoods, which is designed for speed, but doesn’t always work. With smaller data such as this, we need to use a different engine to determine our highest maximum likelihood.
Random intercept model:
tobacco_lmm1 <- glmmTMB(nlegion ~ treat + (1|leaf), data=tobacco,
REML=T)
(1|leaf) is interpreted as ‘do the intercept [‘1’] by [‘|’] leaf as a random effect’
REML is residual or restricted maximum likelihood, is required when comparing different random effect structures.
Random slope and intercept model:
tobacco_lmm2 <- glmmTMB(nlegion ~ treat + ((1 + treat)|leaf),
data=tobacco, REML=T); AICc(tobacco_lmm2)
## [1] 110.0038
# equivalent formulation:
tobacco_lmm2 <- glmmTMB(nlegion ~ treat + (treat|leaf),
data=tobacco, REML=T); AICc(tobacco_lmm2)
## [1] 110.0038
((1 + treat)|random) also written as (1 + treat|leaf) or (treat|leaf) is interpreted as ‘do the intercept [‘1’] and slope [’treat’] by [‘|’] leaf as a random effect’
If the model failed to converge, we can use a different optimization algorithm. Here is how to change it:
better_opt <- glmmTMBControl(optimizer = "optim",
optArgs = "Nelder-Mead")
tobacco_lmm2 <- glmmTMB(nlegion ~ treat + (treat|leaf),
data=tobacco, REML=T,
control = better_opt)
tobacco_lmm2
## Formula: nlegion ~ treat + (treat | leaf)
## Data: tobacco
## AIC BIC logLik df.resid
## 100.67050 105.30603 -44.33525 12
## Random-effects (co)variances:
##
## Conditional model:
## Groups Name Std.Dev. Corr
## leaf (Intercept) 3.9701
## treatStrong 0.8194 -0.65
## Residual 3.7587
##
## Number of obs: 16 / Conditional model: leaf, 8
##
## Dispersion estimate for gaussian family (sigma^2): 14.1
##
## Fixed Effects:
##
## Conditional model:
## (Intercept) treatStrong
## 27.061 7.879
Nelder-Mead is quite robust, but much slower. If there is still convergence problems, there is likely an issue with the model we are fitting (too complex for the data available), rather than the optimizer!
Random slope model:
tobacco_lmm3 <- glmmTMB(nlegion ~ treat + ((-1 + treat)|leaf),
data=tobacco, REML=T,
control = better_opt)
## Warning in fitTMB(TMBStruc): Model convergence problem; non-positive-definite
## Hessian matrix. See vignette('troubleshooting')
AICc(tobacco_lmm1, tobacco_lmm2, tobacco_lmm3)
A random intercept model is preferred over random slope and intercept.
plot_model(tobacco_lmm1, type = 'diag')[-2] %>% plot_grid()
plot_model(tobacco_lmm1, type = 'eff')
## $treat
ggemmeans(tobacco_lmm1, ~treat) %>% plot(add.data=T)
summary(tobacco_lmm1)
## Family: gaussian ( identity )
## Formula: nlegion ~ treat + (1 | leaf)
## Data: tobacco
##
## AIC BIC logLik deviance df.resid
## 96.7 99.8 -44.4 88.7 14
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## leaf (Intercept) 13.63 3.692
## Residual 14.46 3.803
## Number of obs: 16, groups: leaf, 8
##
## Dispersion estimate for gaussian family (sigma^2): 14.5
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 27.061 1.874 14.440 < 2e-16 ***
## treatStrong 7.879 1.901 4.144 3.42e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To change the covariance matrix to a correlation matrix:
cov2cor(vcov(tobacco_lmm1)$cond)
## (Intercept) treatStrong
## (Intercept) 1.0000000 -0.5073298
## treatStrong -0.5073298 1.0000000
Now, we can see the correlation between weak and strong is -0.51.
To get confidence intervals:
tidy(tobacco_lmm1, conf.int=T)
# for just fixed effects:
tidy(tobacco_lmm1, effects = 'fixed', conf.int=T)
# for just random efffects:
tidy(tobacco_lmm1, effects = 'ran_pars', conf.int=T)
To get how different each intercept is:
(leaf_re <- tobacco_lmm1 %>% ranef %>% unlist)
## cond.leaf.(Intercept)1 cond.leaf.(Intercept)2 cond.leaf.(Intercept)3
## -0.3539126 -1.5406161 -0.7111775
## cond.leaf.(Intercept)4 cond.leaf.(Intercept)5 cond.leaf.(Intercept)6
## -2.5604485 -1.2399299 3.5023104
## cond.leaf.(Intercept)7 cond.leaf.(Intercept)8
## 5.6216063 -2.7178321
mean(leaf_re) # essentially zero
## [1] -2.081668e-17
sd(leaf_re)
## [1] 2.984558
To get pseudo-r^2:
MuMIn::r.squaredGLMM(tobacco_lmm1)
## R2m R2c
## [1,] 0.3707882 0.6761025
First is the marginal effect R^2 (fixed effects only)
Second is the conditional R^2 (fixed and random effects together)
We can also use the performance package:
performance::r2_nakagawa(tobacco_lmm1)
## # R2 for Mixed Models
##
## Conditional R2: 0.676
## Marginal R2: 0.371
tobacco_lmm1 %>% emmeans(~treat) %>%
as.data.frame() %>%
rename(nlegion = emmean, lwr = lower.CL, upr = upper.CL) %>%
ggplot(aes(x = treat, y = nlegion)) +
geom_pointrange(aes(ymin = lwr, ymax = upr)) +
geom_jitter(data = tobacco, col = "red", height = 0, width = 0.1) +
labs(x = "Treatment", y = "Number of legions")